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Abstract 

Expectile regression is a nice tool for investigating conditional distributions beyond the 
conditional mean. It is well-known that expectiles can be described with the help of the asym¬ 
metric least square loss function, and this link makes it possible to estimate expectiles in a 
non-parametric framework by a support vector machine like approach. In this work we de¬ 
velop an efficient sequential-minimal-optimization-based solver for the underlying optimization 
problem. The behavior of the solver is investigated by conducting various experiments and the 
results are compared with the recent R-package ER-Boost. 


1 Introduction 


In standard nonparametric regression analysis, most of the methods developed so far are based 
on the least square loss function for estimating conditional expectations. In many applications, 
however, it is required to study conditional distributions beyond means. A nice tool for this 
purpose was offered by [20] in the form of quantile regression, which allows both the location and 
the spread of the response variable to be studied by using asymmetric least absolute deviation loss 
function (ALAD). We refer the reader to [19, 37, 9, 33] and references therein, for details description 
and different estimation methods for quantile regression. Following the spirit of quantile regression, 
[21] proposed the asymmetric least square (ALS) loss function 


L r (t) 


rt 2 if t ^ 0 

(1 — r)t 2 if t < 0, 


( 1 ) 


to compute conditional expectiles, also called regression expectiles. These expectiles were found an 
interesting alternative to quantiles in many applications due to the computational advantages. For 
example, [3] used the expectile-order to determine the conditional ordering of individual values 
relative to other members of data sets, [31] developed an expectile-based technique to compute the 
distribution of treatment effects on the tail of the outcome variable in the presence of confounding 
mechanism, and [14] compared expectile regression with quantile regression for forecast evaluation 
under asymmetric loss functions and showed that expectile treatment effects provide more efficient 
estimates. There are some other areas where expectlies have been applied successfully, for instance, 
in demography, see [23] and in education, we refer to [29]. Moreover, in finance expectiles play an 
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important role for risk measures of financial asserts, see for instance [2, 15, 26, 42], For example, it 
has been shown recently that expectiles are the only coherent risk measures, see [5, 36]. Moreover, 
the frequently used expected shortfall (ES) is a conditional mean of a random variable given that 
it is less than a certain quantile. In other words, ES can be written as a function of both quantiles 
and expectiles, which requires to establish a connection between quantiles and expectiles.This leads 
to the expectile-based quantile estimates, which can be more efficient than empirical quantiles [41]. 
In this regard, recall that, there is one-to-one mapping of expectiles to quantiles that was explored 
by [12] and further supported by [1, 45, 38]. Moreover, [7] embedded both quantiles and expectiles 
in the general class of M-estimators by proposing asymmetric M-estimators. 

Some semiparametric and nonparametric expectile estimation methods have already been pro¬ 
posed in literature. For example, [24] considered penalized splines to compute smooth expectile 
estimates, [28] proposed a couple of different procedures including least asymmetrically weighted 
squares in combination with mixed models, boosting within an empirical risk minimization frame¬ 
work, and a restricted expectiles regression model. Moreover, [27] derived asymptotic properties of 
expectile regression estimates and used them to construct corresponding confidence intervals. Fur¬ 
thermore, a kernel method based on local linear fits was considered in [45], and a boosting method 
using regression trees was proposed in [44], Finally, two expectile regression packages, ER-Boost 
[44] and expectreg [30], have recently been made available. 

Another family of non-parametric estimation methods are the so-called kernel based regularized 
empirical risk minimizers, which include the well known support vector machines (SVMs) [39, p. 
138ff]. These kernel-based methods often enjoy state-of-the-art empirical performance, relatively 
simple implementations, and a high flexibility. Recall that their flexibility is based on two main in¬ 
gredients, namely the reproducing kernel Hilbert space (RKHS) H and the loss function L. Namely, 
the RKHS can be used to adapt to the nature of the input domain X, or more precisely, enables us 
to use both standard M n -valued data and non-standard data such as strings and graphs. Moreover, 
due to the so-called kernel-trick [25], the choice of H has little to no algorithmic consequences for 
solving SVM optimization problems. On the other hand, the choice of L determines the learn¬ 
ing goal [32, Chapter 3]. For example, the so-called hinge loss is used for classification, the least 
squares loss leads to conditional mean regression, and the ALAD is used to estimate quantiles. 
Unfortunately, however, different L lead to different optimization problems, which in turn require 
different solvers. For the above mentioned loss functions various solvers have been designed, see 
for example [8, 10, 13, 18, 37] and references therein for more detail, but besides [16], who consid¬ 
ered a kernelized iteratively reweighted strategy, no solver for the ALS has been proposed. In this 
paper, we derive a sequential minimal optimization (SMO) based solver, see [10] and particularly 
[22], for the ALS, which enables us to handle large data set efficiently. In addition, we consider 
different initialization methods and working set strategies in detail and validate them empirically, 
to further speed up the solver. Finally, we report some experiments that compare our solver with 
the ER-Boost package. 

The rest of the paper is organized as follows: Section 2 presents the formulation of the primal 
and the dual optimization problem of SVMs. Section 3 proposes an algorithm to perform one dual 
variable update per iteration along with the stopping criteria and initialization methods. The exact 
two dimensional optimization problem with some working set selection strategies is discussed in 
Section 4. Some experiments and discussion on the results can be found in section 5. Finally, the 
appendices contain proofs of theorems and lemmas, and detailed results from experiments. 
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2 Primal and Dual Optimization Problem 


Let us consider a training set D := ((aq, yi), (. xi , 2 / 2 ), • • • (x n , y n )) £ (X x M) n that is sampled from 
some unknown distribution P on X x Y", where X is an arbitrary set and Y C M. In addition, we 
assume that / : X —> M is a function and L : Y xR^ [0, 00 ) is an arbitrary convex loss function 
defined in (1). Then the goal of supervised statistical learning is to find a function / such that the 
risk 

K L ,p(f) '■= f L (yJ(x))dP{x,y ), 

JXxY 

is small. This means that 7 Zi,,p(f) has to be close to the optimal risk 

7 Z* L P := inf{TZi p(f)\f : X — > M measurable} , 

which is called the Bayes risk with respect to P and L. Since the data generating distribution P is 
unknown, we replace 7 £l,p(/) by its empirical counterpart 

1 n 

7Il,d(/):=-V%,/W). ( 2 ) 

n z — J 

i= 1 

Now, recall that the support vector machines (SVMs) solve the regularized problem 

Id ,a = argmin A||/||^ + 7 Il,dU) > ( 3 ) 

where A > 0 is a user specified regularization parameter and H is the reproducing kernel Hilbert 
space (RKHS) over X with reproducing kernel I::lxl->R, see e.g. [ 6 , 4, 32], For example, for 
input domains X C M rf , one often uses SVMs that are equipped with Gaussian radial basis (RBF) 
kernels. Recall that the latter are defined by 

fc 7 (x, x') := exp(— 7 2 ||x — x '\\\), x, a/G (4) 

where 7 > 0 is called the width parameter that is usually determined in a data-dependent way, 
e.g., by cross-validation. Note that is normalized, that is, k^(x,x) = 1 for all x G M rf , and all 
kernels we consider below are also normalized. By [32, Theorem 4.56], is also universal on every 
compact subset X G R n and in particular strictly positive definite. Furthermore, the RKHS TL 7 
induced by £ 7 , is dense in L p {y) [32, Chapter 4], where // is a finite measure of M n and p G [1, 00 ). 
Therefore, the following consistency result applies to Gaussian kernels. 

Theorem 1 . Let P be a distribution on X x M with f x y 2 P(dy\x)dPx(x) < 00 , L be the r- 
asymmetric least squares loss, and /£ p be the conditional r-expectile function. Moreover, let k be a 
bounded, measurable kernel whose RKHS is separable and dense in L- 2 {Px)- Then for all sequences 
\ n —> 0 with A fn —> 0 and all £ > 0, we have 

lim P n (D G {X X M) n : n L , P (f D ,x n ) - P* L P > e) = 0 , 

n—>■ 00 ’ 


and 

lim P n (D G (X x M) n : \\f D ,x n - fl , P ||o > e) = 0, 

where ||<?||o := /min{l, \g\}dPx is a translation-invariant metric describing convergence in proba¬ 
bility Px ■ 
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To deal with (3) algorithmically, we £x a feature space H$ and a feature map : X —> Hq of 
Then every / £ H can be represented by w € Hq via 


/(Xj) = (W,</>(x;)) , 

see [32, Theorem 4.21] for further details. Note that the latter theorem also shows that 

|H = inf{11w11 : w G H o with / = (w,</>(•))} , 


(5) 


( 6 ) 


where := X —> H is the canonical feature map from the input space to RKHS. Using (2) and (6) 
in the objective function (3), we obtain the standard regularized problem for SYMs without offset 


1 


arg min A||w||^ + - V Lfa, f(xi )). 
weH 0 u n ' 


(7) 


i =1 


If L is the hinge loss function, then it is shown by [35] that the SVM without offset not only faster 
but also achieves accuracy that is comparable to SVM with offset. One reason for the faster training 
time was that the offset leads to an additional equality constraint for the dual problem and as a 
consequence, SMO type solvers can only update certain pairs of dual variables. In addition, the 
offset makes it relatively expensive to calculate the duality gap [10], which may serve as a stopping 
criterion for these solvers. 

In the following, we will adapt the ideas of [35] to design a solver for (7) in the case of L being 
an asymmetric least squares loss. To this end, we first reformulate the primal optimization problem 
(7) such as 


arg min R C (w,£+,£_) := ^||w|| 2 + Ct^£ + + <7(1 - r) 
(w ^- } 2 
such that &,+ > Vi~ (w, </>(xj)), 

&> (w, </>(xj)) - yi , 

>0, \/i = l,...,n 


( 8 ) 


where C := > 0. Using standard Langrangian techniques, see e.g. [10, Chapter 6], one can 

easily obtain the dual optimization problem 


arg max W(a, f3) := (a — /3, y) - \{a - / 3,K(a - f3)} - -^-(a,a) - 
(a,/3) 2 4Gr 

oii > 0, f5i > 0 . V z — 1.. n 


4(7(1 - r) 


(13, P) 


(9) 


Here y is the n x 1 vector of labels and K is the n x n matrix with entries Kij := k(xi,Xj), i,j = 
1,..,, n. Note that (8) is a convex function as the loss function (1) is a convex suffices. Analogously, 
it is not hard to see that the dual optimization problem (9) is concave. This ensures the fulfillment 
of the strong duality assumptions [10, Chapter 5] and consequently, the primal optimal solution 
can be obtain from the dual optimal solution using the simple transformation, which is 


w := - A)0( x i 

i=i 


In addition, the quadratic nature of (9) allows us to solve it using the quadratic programming 
(QP) techniques. However, many QP techniques that are implemented to solve dual optimization 


4 



problems, for example, interior point methods [43, 25], are impractical for large scale problems. 
Decomposition methods, such as chunking [39] have been designed to handle this difficulty by 
breaking the optimization problem into smaller subproblems and solving them iteratively. The 
limiting case of decomposition methods is the Sequential Minimal Optimization (SMO) methods 
that optimizes two coordinates at each iteration [22] for SVMs with offset and hence, does not 
require storage of the entire kernel matrix. Section 4 presents this idea in more detail in view of 
expectile regression without offset. It is also worth noting that SVMs without offset allows us to 
develop an SMO type algorithm that performs one dual variable update per iteration as a starting 
point [35]. In the following section, we introduce this algorithm in details. 


3 One Working Set Solution 


Our goal in this section is to develop an SMO type algorithm that updates a single coordinate at 
each iteration. For this, we first compute one working set solution. Then we establish a rule to 
select a direction in which update should be performed, and a criterion to stop the algorithm. In 
the end, we present the procedures to initialize the coordinates. 

Let us first compute the gradients for cc* and /3, from (9) that will be used throughout this 
paper. For this, we take the partial derivatives of (9) w.r.t. a t and P % and obtain the following 


VW aj (a, f3) = (ej, y) - <e*,iL(a - /?)> - , 

VW ft (a,/3) = —(e*,y) + (e h K(a - P)) - 


( 10 ) 


We now recall [10, p. 131ff] and reformulate the dual objective function (9). For a,P £ M" and 
an index i £ {1,..., n}, we write a\ z := a — a t e t and /A* := P — foei where e* is the i-th vector of 
standard basis of M n . Now the basic calculus together with K t i = 1 for normalized kernels leads 
to the following dual objective function for the ID-problem 


IF(a^* + ajej,/3^'+/3jej) : — 


IF(a V \/3 V ) + (a* - A)(e*,y) - J(a* ■ 


Pi? 

4(7(1 — r) ' 


( 11 ) 


Taking partial derivative of (11) w.r.t. 
equations 


a.i and Pi and setting them to zero yields the system of 


biai - Pi = Ci , 

on - b- 2 Pi = Ci , 


( 12 ) 


where 


h 


b 2 


2Ct + 1 
2 Ct ’ 

2(7(1 - r) + 1 
2(7(1 - r) ’ 


Ci = (ei, y) - (ej,/i(« V ~ P'*)) = 


After solving (12), we obtain the global solution 


VlF ai (a, P) + b\ (a, a) - (e*, P). 


(13) 


b 2 - 
b\b 2 - 1 1 


Pt 


l-In 
b\b 2 - 1 1 


(14) 
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Note that b\, b 2 £ (1,00) for all C > 0 and r £ (0,1). Therefore, it is not hard to see from (14) that 
a* = f3* =0 if and only if c % = 0. On the other hand, for all Ci £ M \ {0}, (14) leads to the relation 


a „■ = — - 


1 — r 


Pi , 


(15) 


which implies that the global solution violates the constraints of the dual problem (9). In 

other words, the global maximum that is attained by (9) does not lie in the set of feasible vectors. 
The following general theorem describes the way to find the solution in this situation. 


Theorem 2. Let W : W n —> M be a concave and twice continuous differentiable function and 
A C M m be a closed convex set. Assume that there is exactly one a* £ W n with W'(a*) = 0. Then 
the following statements hold: 


i) For all a a* we have W(a*) > W(a). 

ii) If a* A, then there exists an a* £ 8A such that W(a k ) > W[a) for all a £ A. 


Theorem 2 says that either a* is the optimal feasible solution or there is an optimal feasible 
solution on boundary {(0,/%) : Pi > 0} U {(«j,0) : a* > 0}. Now (14) shows that we have 
exactly one value (a*, fS *) at which derivative vanishes and (15) shows that (a*, f >*) is not feasible. 
Consequently, we need to look at the boundaries to search for an optimal feasible solution. To 
this end, we split the problem into two cases. In the first case, we plug 01 % = 0 in (11) and then 
differentiate w.r.t. /%, which provides 


dW{a\\^ + Pid) 
dpi 


-<e*,y) + (ei,K(aP l - /3 V )) - b 2 {e i: p). 


Setting it to zero gives 


a t = 0 • 



Similarly, for the second case, plugging Pi = 0 in (11) and differentiating w.r.t. a* yields 


(16) 


dW{aS l + CHei,p\ l ) 
'don 


(e%, y) - (ei,K(aP l - P^ 1 )} - bi (e*, a) . 


Equating it to zero provides 




(17) 


Since bi,b 2 G (l,oo) are fixed constants for certain r, therefore, (16) and (17) solely depend on c t . 
In particular, if c* 7^ 0, then we show in the following theorem that either (16) or (17) gives the 
feasible optimal solution. 


Theorem 3. For i = {l,...,n}, let c, £ M and b\,b 2 £ (l,oo) be defined by (13). Then the 
following implications holds: 


i) 

ii) 
Hi) 


If Ci < 0, 
If ^ = 0, 

If Ci > 0, 


then (16) is the feasible solution. 

then (16) and (17) are the same feasible solution. 

then (17) is the feasible solution. 
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In particular, exactly one of the two cases produces a feasible solution (off, (if), and this is given 
by 

af = max (o, ^ , [if = max ^ 0 , ■ 

After finding the feasible optimal solution, the next task is to determine the coordinate i in which 
the update should be performed. Many approaches have been discussed so far for this purpose. A 
simple approach [10, p. 132-133] is to update for each coordinate i = 1,... ,n iteratively. Another 
method [40] is to choose the coordinate for update that violates the Karush-Kuhn-Tucker (KKT) 
conditions of optimality most. The latter approach is implemented in SVMs packages, S\M. li9ht [17] 
and LIBSVM [ 8 ]. Another idea, see [35], which is followed in this work, is to choose the coordinate 
i* whose update achieves the largest improvement for the value of dual objective function W. In 
other words, it performs the update in the direction 

i* G arg max W{a + 5ei, (3 + rjef — W (a, /3), (18) 

where S t = af — a* and ry = / if — fi t denote the difference between the new and the old values of 
a t and /% respectively. Based on this idea, we establish a rule in the following lemma to compute 
the improvement in the value of dual objective function W. 

Lemma 4. Let i G {1,..., n}, a, /3 € M n , and 6, rj £ R. Moreover let b\, 62 £ (1, 00 ) be defined by 
(13), then we have 

G(6 , 77 ) := W(a + Sej, /3 + r/e*) - W(a, f3) 

= s(vW ai (a,/ 3 )-^p) +77 +Sr h (19) 

With the above lemma, the Procedure 1 solves (18) to search the best direction. 


Procedure 1 Calculate i* £ arg mM, = i r .. n \ W{a + 6ei,0 + gef) — W(a,f3) 

bestgain < - 1 

for 7 = 1 to n do 

6i i- max (0, g-) - a t 

Vi max (°>-§) ~Pi 
gain <- G(5 l ,gf) 
if gain > bestgain then 
bestgain <r- gain 

7 i — 7 
6-i* t— Si 

77 i* Vi 

end if 

return i*,5i*,r]i* 

end for 


3.1 Stopping Criteria 

Solving problem (9) by some iteration method requires an appropriate stopping criteria. Several 
stopping criteria have been suggested so far for SVMs with offset. One method is to stop training 
when the KKT conditions are satisfied up to some predefined tolerance e > 0. Another method 
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is to use the duality gap as a stopping criteria [10, p. 109 and 128]. This method is also adopted 
by [35] to formulate a duality gap for SVM without offset. Following this idea, we define for dual 
variables a £ M+ and (3 £ R+ 

fa,13 ■= (a- 13, Kef) , (20) 

which gives ||/ ajy g||^ = (a — (3, K(a — j3)). As a result, the primal objective function (8) is 

^ n n 

P(fa,p 1 &,+,&,-) = x <« - P, K{a - 13)) + CtJ2 fl+ + c(l -t)J2 • 

i=1 i=1 

Following [35], the duality gap of -P(/ a ,/3j &,+ >&,-) and W(a,/3) is defined as 

S(a,f3) :=P(f atP ,-W(a,P), (21) 

which tells us to stop the iteration method of solving problem (9) if S(a,f3 ) < e, where e > 0 is 
some predefined tolerance. To efficiently compute S(a,f3), we split it into 

T(«, /3) = !<(« - ^), K(a - 0)) - W{a, (3 ), 

n n (22) 

E{a,f3) = rY J Cl+ + {l-r)Y J Cl-, 

i= 1 i=l 

and as a result we have S(a, (3) = T(a, (3) EC ■ E(a, {3). The value T(a, f3) can be obtained at each 
iteration by updating it in the chosen direction i, such as 

T(a + 5ei,/3 + ijei) = T(a,/3 ) - U(a i} Pi,6,rj ), 

where 

U(ai,Pi,6,r )) := 5 

+ 

Unlike T(a,/3), the value E(a,(3) can not be updated but needs to be computed from scratch at 
each iteration. To find an efficient formula, we first note that combining (8) with (20), we have 

Ci,+ = max {0, (y,ei) - (a - /5,iFe*)} = max jo, VW ai (a, (3) + j , 

and 

Ci - = max {0, (a - ft, Kef) - (y, e*}} = max jo, -VW ai (a, (3) - } ' 

With these formulas, the computation of E(a, (3) is an 0(n) operation. Let us now consider a little 
more involved stopping criteria based on [32, Chapter 7], that looks for an f a p £ H with 

M\fa,p\\ 2 H + Kl,d{ f a,fi) < vfm\\\f\\ 2 H + TZl,dU) + e, (24) 

where f a a is clipped at EM £ M. Formally speaking, the clipped value of f a $ : X —> M at 
EM £ M is defined by 

/ a„e = 


2 VW ai (a, f3) E (y,ei) E 


(a, ef) (bi E 1)5 


V ( VI Vp.(a,P) E (y, ef) E ^ ^ ) E2Srj. 


-M if fa,/3 < —M , 

fa,{3 if f a ,p€[-M,M] 

-M if f a ,/3 > —M . 



In other words, we restrict f a ^ to the interval [— M, M], which in turns, reduces the risk 77 l,£>(/)• 
However, clipping does not change the learning method since it is performed after the learning 
phase. Based on this idea, the clipped version of (20) after using (20) is 


/ a,pi x i) = (e.,y) - VfF Qi (a,/3) - _ M , 

which leads to the clipped £*,+ and £j._ as 

(a, ef) t M 


ti,+ = max 7 0, (y,ei) - (e*, y) - VW ai {a,ft) - 


2Ct J-m M 


£ i - = max < 0, 


<e i ,y)-VW ai (a,/3)- 


M 


(«, Ct) 

2CV J-M 


- <y, ei) ^ . 


We further define 


n ^2 

f 

2=1 


n ^2 
£ 

2=1 


£(«> Z 3 ) : = r £ *,+ + t 1 “ r ) £i, 

Then we see that (24) is satisfied if 


S(a, ft) := T(a, ft) + C ■ Efa ft) < —. 


(25) 


(26) 


(27) 


The clipped slack variables used in the stopping criteria (27) may provide a substantial decrease in 
duality gap in each iteration of learning algorithm compared to the unclipped slack variables used 
in (21), and hence the learning algorithm may require less number of iterations. [34] showed that 
the right hand side of the stopping criteria given in (21) should be replaced by A as in (27), where e 
has the same value for both. Furthermore, it is argued by [35] that unlike the duality gap stopping 
criteria for SVM with offset given by [10, p. 109f], both (21) and (27) are directly computable since 
they do not require the offset term. From this it is easy to derive an 0(n) procedure that updates 
VlF a (o!,/3), VWp (a, ft) and calculate S(a,ft). The pseudocode for this is presented in Procedure 

2. The one for S(a,ft ) is an obvious modifications and therefore omitted. 


Procedure 2 Update VW ai (a, ft) and VW^fa, ft) in direction i* and calculate S(a,ft) 

Tfaft) <r- T(a, (3) - Ufa, fti,6,r]) 

E(a, ft) e- 0 

for k = 1 to n do 

VW Qfc fa ft) <- VW ak fa ft)-{6- n)K ik - ^ s lk 
V% (a, ft) <- VW Ph fa ft) + (6- rj)Ki k - S ik 

fk,+ <- max{0, VW ak (a, (3) + 

<- max{0, ~VW ak faft) - } 

Efa ft) <- Efa ft) + fal+ + (1 - 
end for 

S(a, ft) = T(a, ft) + C ■ E(a, ft) 


With all the above computation, we now summarize the basic idea of the 1D-SVM in Algorithm 
1. This tells us to look repeatedly for the best direction i* and performs update in that direction 
until the predefined stopping criteria is satisfied. 


9 




Algorithm 1 1D-SVM solver 

initialize a, /3,X7W a (a, /3),VWp(a, /3) and T(a,f3) 
while S(a,(3) > ^ do 

(**,5,;.,?^.) Procedure 1 
on* e- a.j* + Si * 

Pi* Pi* + Vi* 

use Procedure 2 to updateVW a (a,/?), VVP / g(a,/3) in direction i* by A* and r?,. and calculate 
S(a,P) 

end while 


A closer look of the Algorithm 1 reveals that there is still need to develop some procedures to ini¬ 
tialize a and /3 , and the corresponding gradients. The following section presents some initialization 
methods to fulfill this requirement. 

3.2 Initialization 

Various approaches are available to initialize a and P and their corresponding gradients. We 
here briefly describe two approaches, namely, cold start and warm start that will be used in the 
implementation of the solver. 

10 & WO: Cold Start With Zeros. This is the most simplest initialization in which we take a <— 0 
and (3 0 to initialize. After a simple calculation, it is not hard to initialize the corresponding 

gradients and the duality gap. 

Wl: Warm Start by Recycling Old Solution. Recall that typically the hyper-parameter A is 
chosen by a search over a grid A = {Ai,..., A m } of candidates values. If these values are ordered 
in the form Ai > ... > X m and the SVM is trained in this order, then the resulting C ^\..., C 
satisfy the property that for all j = 1 ,... , m— 1. For C^ 1 ) we initialize the solver with 

the above cold start and for j > 2, we initialize it with a warm start a <— a* and fd •(— ft* where 
a*, ft* is the approximate solution obtained by training with C old = 1 . Obviously, in this case, 

we can also recycle parts of V a (a, ft), Vp(a,/3) and S(a,ft ) such as described in the Procedure 3. 


Procedure 3 Initialize by a e- a*, ft e- ft*, compute gradients and dual gap 
E{a,ft) e- 0 

for * = 1 to n do 

OL\ A — OL* 

A <— A* 

X7 ai (u,0) <5— W ai {.OL* , ft*) + 2 ^ (|p5Id — fJMw) 

V/3;( a ; A Vfii(a*, ft*) + 2(1 -t) (csra — czsf) 

«- max (0, V ai (a, (3) + 2T gLw ) 
max (0,—V ai (a,/3) - 2r gL w ) 

E(a, /3) <- E(a, ft) + (tA ? + + (1 - t)£?_) 

end for 

T{a,/3) <- T(a,P) - \ (<4,,, - ^) E"=i ("; + 

S(a, p) <- T(a, ft) + C new E(a, ft) 
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4 Working Set of Size Two 


The Algorithm 1 performs an update for one coordinate per iteration. In this section, we extend 
this idea and develop an algorithm to perform an update for two coordinates per iteration. For 
this, we first solve the 2D- problem exactly in the following section. Then we will describe a low 
cost working set selection strategy based on the 1D-SVM solver. In the end, we establish a stopping 
criteria for the 2D-problem. 


4.1 Exact Solution of Two Dimensional Problem 


Let us fix two coordinates i , j £ {1,..., n} with i ft j. We further assume that e* and ey are the i-th 
and j- th vectors of standard basis of M n , and write aA*’- 7 := a— a^i — a 3 e 3 and ft*’- 7 := f3—/3iei—/3jej. 
By this and using Aft = Aft = 1 for normalized kernels, the dual objective function for 2D-problem 
is 

W := W(a ^’ 3 + Ojej + ayey, ft*’- 7 ’ + fte* + ftey) 

= W>\*ft ft*’ft + W(a u Pi) + W(a j ,P j ) - (a* - ft)(ay - ft) Aft , 


(28) 


where 


W(cti,Pi) := {ai - Pi)(ei, y) - (a* - ft)(ej, Afta^*’- 7 - ft*’- 7 )) - ^(cc* - ft) 


1 


4Ct(1 — r) 


((1 - r)af + rft 2 ), 


W(aj,Pj) := (ay - Pj)(ej, y) - (ay - P j )(e j , K(a S ' l,J - ft lJ )) - -(ay - ftft 


-((1 - r)a 2 + rft 2 ). 


4CV(1 — r) 

Taking partial derivatives of (28) w.r.t. a,;, a j, ft and ft, we obtain the gradients 

VWft = ft, y) - ft, Afta^’ - ftft) - biou + ft - (ay - ft)A*,y , 
VWft = y) + ( ei , K(a^ 3 - ftft) + a* - ftft + (ay - ft)Aft , 
VILft = ft, y) - (ey, A(a VJ - ftft) - 6 , ay + ft - (a* - ft)Aft , 
VWft = -(ey, y) + (ey, AT(a^’ - P \ft) + a 3 - b 2 pj + (a* - ftftft , 


(29) 


where ft, ft are defined in (13). By setting partial derivatives (29) to zero, we obtain the following 
system of equations 

bicti - ft + kaj - kPj = Ci , 

OLi - b 2 Pi + ka 3 - kPj = Ci , 
kat - kPi + ft ay - Pj = Cj , 
kat - kPi + ay - ftft = cy , 


(30) 


where 


k := K, 


y , 


°i := (ei,y )-(e i ,K(a\ i ’ 3 -p\ i ’ 3 )), 

= VWft(a,ft +b\(a,ei) - (ft e,) + (a - fteyft, 
Ci ■■= ft,y)-ft,lftaV’ 7 -ftft) 

= VW aj (a,j3) + ft(a,ey) - (ft ey) + (a - P,ei)k. 
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Let a* , a*, [3* and p* be the solution of (30). Then solving (30) by matrix operations leads to the 
following global solution 

\M\ a* = ( b 2 - 1)(6i&2 - 1 )ci + (1 - 6 2 )(&i + b 2 - 2)kcj , 

\M\ p* = (1 - 6i)(6i& 2 - 1 )ci + (6i - l)(6i + b 2 - 2)k Cj , 

|M| a* = (b 2 - l)(b\b 2 - 1 )cj + (1 - 6 2 )(6i + b 2 - 2)fec*, 

|M| p* = (1 - 6 1 )(6 1 6 2 - l) Cj + (6i - l)(6i + 6 2 - 2)fe Ci . 

Here 

\M\ ■= b\(b 2 2 - k 2 ) - 2b 1 (b 2 k 2 + 62 - 2k 2 ) - (b 2 - 2 fk 2 + 1, 


is always positive. This is shown in the following lemma 
Lemma 5. For 61,62 £ [l,oo) and |fc| < 1, we have |M| > 0. 


Note that, in the case of c t = cj = 0, we have a* = PI = a* = P* = 0. On the other hand, if 
Cj 7 ^ 0 or Cj 7 ^ 0, then (31) together with Lemma 5 leads, after some calculations, to the following 
equations 


-1 h'l 1 

I — T 


-Pi 


a A = —; 


1 — T 


Pi 


Since r £ (0,1), the global solution (31) thus violates the constraints of (9) iff a / 0 or Cj / 0, 
that is, the solution is not feasible. To obtain the feasible solution, we know by the Theorem 2 
that we need to look at the boundaries of the feasible region. In our case, this means that we need 
to set some of the dual variables to zero. Note that this is a simple extension of the idea that is 
presented in ID-problem. Let us begin by setting one dual variable to zero, say 07 = 0. Computing 
the gradients with the remaining variables, we get the last three expressions of (29) where we set 
cti = 0. After setting the gradients to zero, we obtain the system of equations 


-b 2 Pi + kaj - kPj = Ci , 

—kPi + 6 iaj - Pj = Cj , (32) 

—kPi + ctj - b 2 Pj = Cj , 


where k,Ci,Cj,bi and b 2 are the same as in (30). Let us write a^,Pf and P^ be the solution of 
(32). Then, by subtracting the last two equations of (32), we obtain 

a t = -r- T F’ < 33 > 

and hence this solution is again not feasible. In a similar way, setting Pi = 0 provides the following 
system of equations 

b\cti + kaj — kPj = Ci , 
kai + 61 a j — Pj = Cj , 
kai + a 3 - b 2 Pj = Cj , 

which again leads to (33) and thus the same conclusion. The remaining two cases where aj = 0 and 
Pj = 0 can be treated analogously. After this, we now consider the situation where two variables 
are set to zero. For this, we split the problem into six subcases. Let us consider the first subcase 
where we set 07 = 0 and Pi = 0 in (28). Taking derivatives w.r.t. aj and Pj provides 

VW a , (a\\ p\*) = ( ej , y) - (, ej,K(a W - /3^')> + & - ha 3 , 

VW^a^) = ~(ej,y) + ( ej ,K(a ^ - P^ j )) + a, - b 2 p 3 . 


12 



Setting (34) to zero, we obtain the system of equations 


bictj - /3j = Cj , 
a j - b 2 f3j = Cj . 


Let at and /?t be the solution of (35). Then subtracting equations of (35) leads to 


a • 


T 

1 — T 



(35) 


(36) 


which shows that the solution is not feasible. Analogously, the second subcase where aj = 0 and 
/ 3j = 0 leads to the same conclusion. In the third subcase, we set a% = 0 and a 3 = 0 in (28) and 
differentiate w.r.t. (3 t and / 3j which gives 

= -(e*, y) + (e h K(a W - (3^)) - - b 2 (3 i , 

VW>, («H 0) = ~{ ej , y) + (ej, K(a^ - 0^)) - 0^ - b 2 0j . 

Setting (37) to zero, we obtain a system of equations which, after some calculations, provides the 
solution 


af = 0 , atj- = 0 , 3~ - |/i|| 1 (kcj - b 2 a) , /3+ = \Bi\ 1 (ka - & 2 Cj), (38) 

where |-Bi j := — t 2 > 0. Considering the forth subcase, we set 0i = 0 and 0j = 0. Analogous to 

third subcase, the gradients are 

VW ai (a,/3\^) = (ej,y) - (e*, tf(a^' - 0^)) - a 3 K- b lUi , 

V^(«,^ j ) = (e j5 y) - (ej,K(a\^ - 0^)) - a^j - b x a 3 , 

which leads to the solution 

fit = 0 , /?/ = 0 , af = \B 2 \ 1 (&iCj - kc 3 ) , at = \B 2 \ 1 ( hcj - kct) , (39) 

where |1 := b\ — k 2 > 0. For fifth subcase, we set a 3 = 0 and [3j = 0 and obtain the following 

solution 


a* = 0, £/ = 0, Pt = 1 ^ 3 1 1 (feiCj - kcj ), at = |-B 3 | 1 (kc i -b 2 c j ), (40) 

where |A? 3 1 := k 2 — b\b 2 < 0. Finally, for the last subcase where aj = 0 and 0i = 0, the solution 

can be obtained by interchanging i with j in the solution of fifth subcase, which is 

Pt = 0, a+ = 0, otf = |-B 3 | 1 (kcj — b 2 Ci ), /0t = |S 3 | 1 (6iCj - kci ). ( 41 ) 

It is interesting to note that the solutions (38), (39), (40) and (41) have the following common 
expressions 

T\ := kcj - b- 2 Ci , (42) 

T-2 := kci — b 2 Cj , (43) 

T 3 := bici - kcj , (44) 

T 4 := 61 Cj — fee*. (45) 


The following lemma investigates the behavior of the above four expressions. 
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Lemma 6. Assume that Ci / 0 or Cj / 0. Then the following implications hold: 

i) IfT\ > 0 and T2 > 0 then we have Ci < 0 and Cj < 0. 

ii) If T3 > 0 and T4 > 0 then we have Ci > 0 and cj > 0. 

In particular, the expressions Ti,T 2 ,T 3 and T4 are not simultaneously positive or negative. 

Using Lemma 6, the following theorem shows that only one case from (38), (39), (40) and (41) 
provides the feasible optimal solution. 

Theorem 7. Assume that q / 0 or Cj / 0, then exactly one of the four cases (38), (39), (40) and 
(41) produces a feasible solution. Moreover, the following implications hold: 

i) If T\ > 0 and T 2 > 0, then (38) is the feasible solution. 

ii) If T -3 > 0 and T4 > 0, then (39) is the feasible solution. 

Hi) If T\ < 0 and T3 < 0, then (40) is the feasible solution. 

iv) If T 2 < 0 and T4 < 0, then (41) is the feasible solution. 

Theorem 7 also suggests to impose if conditions based on expressions (42), (43), (44) and (45) 
in the implementation of the algorithm for 2D-SVM solver. This helps to reach directly to the 
feasible optimum solution. 

4.2 Working Set Selection Strategies 

In this section, we address the question how to choose the directions i* and j* in which the 2D- 
SVM solver performs an update. Several possibilities are available for this task. A straightforward 
approach is to consider all pairs of directions ( i,j ) and choose the one for which the 2D-gain of 
W is maximum. Note that the 2D-gain is simply an extension of the idea presented in Lemma 4. 
Formally, for a, f3 £ R n and 5, g £ R, the 2D-gain is 

W(a + dei + Sej,/3 + g e, + gej) - W(a,/3) = G(5i,ru) + G(Sj,gj) - (Si - r p)(5j - 9j)I<i,j , (46) 

where G(Sk,gk ) for k = i,j is the lD-gain defined in Lemma 4. 

It is worth noting that the above described working set selection strategy is not a good choice 
because the search is 0(n 2 ). However it may be viewed as an ’’optimal” two dimensional strategy 
and served as a baseline to all other subset selection strategy that can be interpreted as the low cost 
approximations to this approach. In the following, we describe two low cost working set selection 
strategies that we consider in this work. 

1/I/S5 1: Two lD-direction With Maximal Gain From Separate Subsets. A simple way to preserve 
the low cost search from lD-solver is to split the index set {1 ,... ,n} into two parts {1,..., |} and 
{| +1 ,... ,n} and search for the ID directions with maximum gain over these two parts separately. 
In other words, we can choose the directions i* and j* by 

i* £ arg maxIU(a + Sei, (3 + gef) — W(a , (3 ), 

‘ £ " /2 (47) 

j* £ arg maxIU(a + Sei, (3 + VG) ~ kF (a, (3 ), 

i>n /2 

where S and g are defined in 1D-SVM solver. These chosen directions are used for the first iteration. 
For the subsequent iterations, we search for the new ID directions, i* ew and jf ew , again by using 
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(47). Then we compute the 2D-gain of W for all pairs of old and new directions of previous and 
current iterations respectively and look for a pair for which this gain is maximum. 

1/1/5S 2: lD-direction With Maximal Gain And A Direction Of A Nearby Sample. This is simply 
an extension of 1/i/SS 1. After determining ( i*,j *) by l/l/SS 1, we fix i* and then search for another 
direction j* from fc-nearest neighbors of x t * with respect to the metric 

d(x, x') := ||x — x'\\ 2 . 


4.3 Stopping Criteria 

To formulate the stopping criteria for 2D-problem, we follow the idea that is presented in Section 
3.1. Let us first consider the component T(a,f3 ) of (22) and by using (46), we find the following 
update of T(a,/3) in the directions of i and j 

T(a + dei + 5ej,/3 + ?ye; + rye^) = T(a,/3 ) - t/(a;,/%, <5*, ry.;) - U(aj, /3j,5j,r)j) 

+ 2(<5j — rji)(6j — r]j)Kij , 

where U(ctk, Pk, bk,Vk) for k = i,j is defined in (23). To compute E(a,/3), we first obtain the 
updated gradients in the directions of i and j, and then subsequently compute £/ j+ , Moreover, 

E(a,/3 ) can also be computed for the 2D-problem similar to ID-problem by using (26). With all 
above computations, we now summarize the idea of 2D-SVM solver in Algorithm 2. 


Algorithm 2 2D-SVM Solver 

initialize a, /?, VW a (a, /?), VW^(a, (3) and T(a,f3) 
while S(a,j) > ^ do 

select directions i* and j* 

use procedure 5 to obtain the optimum solution for direction i* and j* 
update a and (3 in the direction i* and j* 

update VW a (a, [3), NWp{ot,(3) in the directions ( i*,j *) and calculate S(a,(3) 

end while 


5 Experiments 

To evaluate the performance of the proposed solver for expectile regression, we perform several 
experiments to address the following questions: 

1. Which subset selection strategy leads to the smallest number of iterations or shortest run 
time? 

2. What is the number of nearest neighbors that leads to the smallest number of iterations and 
shortest run time? 

3. Is there advantage of warm start initialization when the parameter search is performed over 
a grid? 

4. Does the clipping provide a significant reduction in the training time and iterations? 

5. How well does the 2D-SVM-solver work as compared to ER-Boost that is proposed by [44]? 


15 




To answer these questions, we implemented the 2D-SVM-solver in C ++ . The algorithm was 
compiled by LINUX’s gcc version 4.7.2 with various software and hardware optimization enabled. 
All experiments were conducted on a computer with INTEL CORE i7 950 (3.07 GHz) and 8GB 
RAM under 64bit version of Debian Linux 7.8 (Debian 3.2.0-4-amd64). During all experiments 
that incorporated measurement of run time, one core was used solely for the experiments, and the 
number of other processes running on the system were minimized. 

In order to perform the experiments, we have considered nine data sets that were downloaded 
from different sources. These data sets comprises various number of features and vary in sample 
sizes from 630 to 20639. The data sets concrete-comp, updrs-motor, cycle-pp, airfoil-noise 
and HOUR were downloaded from UCI repository. The two data sets NC-CRIME and head-circum 
are available and documented in R packages Ecdat and AGD respectively. The remaining two data 
sets cal-housing and munich-rent were downloaded from StatLib and the data archive of the 
Institute of Statistics, Ludwig-Maximilians-University of Munich respectively. We scaled the data 
sets componentwise such that all the samples including labels lie in [—1, l] <i+1 , where d is the 
dimension of the input data. In addition to that, we generated a random split for all data sets that 
contained approximately 70% training and 30% test samples. Table 1 describes the characteristics 
of the considered data sets. 

In all our experiments with the SVM solver, we used Gaussian kernels (4). To determine the 
hyper-parameters, we have considered a geometrically spaced 10 by 10 grid for A and 7 over the 
interval [cin -1 ,1] and [c2n _1 / rf , C 3 ] respectively, where n is the number of training samples, d is the 
input dimension, and c 1 := 0.001, C2 := 0.1 and C3 := 0.2. Here, the values of the constants were 
chosen with the help of our experience with least square SVMS [11]. To choose the best values of 
these hyper-parameters, we used /c-fold cross validation with randomly generated folds. In our case, 
we have considered k = 5. During the A:-fold cross validation, the hyper-parameter 7 was internally 
converted to 7 := Usujdin. anc [ A to C := 2 {k-i)n\ ’ w h ere (& — 1 )n/k is approximate actual training 
set size for fc-fold cross validation. 


data 

sample sizes 

training size 

test size 

dimension 

NC-CRIME 

630 

441 

189 

19 

CONCRETE-COMP 

1030 

721 

309 

8 

AIRFOIL-NOISE 

1503 

1052 

451 

5 

MUNICH-RENT 

2053 

1437 

616 

12 

UPDRS-MOTOR 

5875 

4112 

1763 

19 

HEAD-CIRCUM 

7020 

4914 

2106 

4 

CYCLE-PP 

9568 

6697 

2871 

5 

HOUR 

17379 

12165 

5214 

12 

CAL-HOUSING 

20639 

14447 

6192 

8 


Table 1: Characteristics of data sets together with the training sizes and the test sizes that refer 
to the splits used in the run time experiments. 

Let us now explore the answers of the above stated questions one by one. To address the first 
question, we performed experiments with warm start initialization method and clipped duality gap. 
In addition, we have considered N = 15 nearest neighbors for WSS 2. The results are presented 
in Figure 2 and 3, which depict that WSS 2 needs substantially less iterations as well as training 
time than WSS 1 on all data sets. For larger data sets such as updrs-motor, head-CIRCUM, 
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CYCLE-PP, HOURS and CAL-housing, the run time and iterations with WSS 2 is at least 50% less 
than WSS 1. Moreover, a closer analysis, see Figure 4 and 5 shows that the savings are obtained 
at the hyper-parameters pairs for which training is particularly expensive, that is, for small A and 
medium to small 7. 

We have fixed N = 15 for WSS 2 so far to address the previous question. To investigate how the 
computational requirements change with the number of nearest neighbors, we performed the exper¬ 
iments for IV-nearest neighbors, where N = 5,10,15, 20, 25, 30, 35,40 for each r = 0.25, 0.50,0.75. 
Again we used warm start initialization and clipped duality gap. Here, it was observed that the 
number of iterations tends to decrease with increasing N. However, for N > 25, only a slight im¬ 
provement in the number of iterations was found whereas the required run time tended to increase 
compared to smaller N. We therefore plotted the results for N = 5,10,15, 20 only. Figure 6 shows 
that the solver attains the minimum training time for N = 15 on almost all data sets. Moreover, 
Figure 7 shows that the number of iterations decreases with increasing N. However, this decrease 
becomes negligible when N > 15. All this together leads us to conclude that N = 15 is the best 
choice for our er-svm solver. Finally, Figure 8 and 9 illustrate the computational requirements for 
different hyper-parameters pairs. Again the largest savings for N = 15 were obtained for small A. 

To answer the third question regarding the initialization methods, we trained with N = 15 and 
clipped duality gap. The results, which are presented in Figure 10 and 11 show that using the 
warm start initialization saves between 20% and 40% of both training time and iterations. The 
detailed behavior for different hyper-parameter pairs is illustrated in Figure 12 and 13. Again the 
savings are more pronounced for smaller A. 

To answer the forth question, we considered stopping criteria with clipped duality gap and with 
unclipped duality gap. Here, we used the warm start initialization option and WSS 2 with N = 15 
nearest neighbors. The corresponding results are shown in Figures 14 and 15. In the case of hinge 
loss function, [35] showed that using the clipped duality gap yields significant reduction, both in 
run times and iterations. However, in our case, we get only a small reduction in iterations, that is, 
1% on almost all data sets. On the other hand, this stopping criteria causes 2% to 17% increase 
in run times on different data sets. This indicates that the unclipped duality gap is the better 
choice in our case. The per grid plot of hyper-parameters for data set CAL-HOUSING, as presented 
in Figure 16 and 17, shows that clipping reduces the run time only for few pairs of hyper-paranreter 
when A is small and 7 is large. For rest of the pairs, unclipped duality gap leads to smaller run 
time. 

Finally, we compare our SVM solver with ER-Boost on the basis of test error and training 
time. For this, we considered our 2D-SVM solver with unclipped duality gap (er-svm), our 2D- 
SVM solver with clipped duality gap (er-SVM*) and ER-Boost [44], Since the experiments using 
large data sets entail long run times, we splitted the data sets into three categories, namely, small 
(n < 5000), medium (5000 < n < 10000) and large (n > 10000). We then conducted experiments 
for ER-SVM, ER-SVM* and ER-Boost by repeating 5-fold cross validation 25, 10 and 5 times for 
the small, medium and large data sets respectively. For the 2D-SVM solvers, we used the 10 by 
10 default grid of hyper-parameters described above. For ER-Boost, we used the default value of 
boosting steps (M = 100) and performed 5 fold cross validation to choose the best value of the 
interaction level (L) between variables, as by the ER-Boost manual. The resulting, average test 
error (standard deviation) and training time are shown in Table 2 and Table 3 respectively. It 
turns out that both SVMs solvers have better test performance than ER-Boost on all data sets, 
but all reported errors are relatively small. Examining the achieved training times for each data 
set, we observe that SVM solvers are more sensitive to the training set size and less sensitive to 
the dimensions of data set, whereas, ER-Boost behaves the other way around. In addition to that, 
the test performance of ER-SVM* is slightly better than er-SVM at the cost of almost 10% longer 
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age (in years) 


sqrt(age in years) 


Figure 1: Estimated expectiles for r = 0.01,0.03,0.05,0.10,0.30,0.50,0.70,0.90,0.95,0.97,0.99 for height 
against age of head-CIRCUM. The graphs comprises expectile curves for original data set (left) 
and data set with transformed age. 


training times. 

In the end, Figure 1 presents the expectile curves for different t considering height against age 
from data set head-CIRCUM. On the left we see some crossing and wiggling problems. Following 
[24], the use of square root transformation on age resolves these issues as the right figure shows. 


18 





DATA 

r = 0.25 

t = 0.50 

r = 0.75 | 

ER-SVM 

ER-SVM* 

ER-Boost 

ER-SVM 

ER-SVM* 

ER-Boost 

ER-SVM 

ER-SVM* 

ER-Boost 

NC-CRIME 

0.00616 

(0.00182) 

0.00555 

(0.00169) 

0.00948 

(0.00177) 

0.00669 

(0.00194) 

0.00605 

(0.00161) 

0.01367 

(0.00305) 

0.00536 

(0.00172) 

0.00509 

(0.00157) 

0.01459 

(0.00405) 

CONCRETE-COMP 

0.00901 

(0.00130) 

0.00893 

(0.00128) 

0.03961 

(0.00365) 

0.01021 

(0.00122) 

0.01013 

(0.00117) 

0.05038 

(0.00417) 

0.00889 

(0.00112) 

0.00879 

(0.00101) 

0.04556 

(0.00339) 

AIRFOIL-NOISE 

0.00814 

(0.00121) 

0.00806 

(0.00119) 

0.04223 

(0.00211) 

0.00947 

(0.00134) 

0.00939 

(0.00115) 

0.04817 

(0.00256) 

0.00855 

(0.00092) 

0.00850 

(0.00087) 

0.03832 

(0.00218) 

MUNICH-RENT 

0.00131 

(0.00033) 

0.00126 

(0.00030) 

0.01569 

(0.00087) 

0.00122 

(0.00029) 

0.00121 

(0.00029) 

0.01812 

(0.00113) 

0.00101 

(0.00018) 

0.00101 

(0.00016) 

0.01598 

(0.00103) 

UPDRS-MOTOR 

0.02518 

(0.00152) 

0.02502 

(0.00152) 

0.05345 

(0.00069) 

0.02844 

(0.00159) 

0.02828 

(0.00152) 

0.06257 

(0.001496) 

0.02585 

(0.00166) 

0.02569 

(0.00169) 

0.015229 

(0.001787) 

HEAD-CIRCUM 

0.00323 

(0.00008) 

0.00323 

(0.00008) 

0.02419 

(0.00047) 

0.00390 

(0.00011) 

0.00390 

(0.00011) 

0.02482 

(0.00057) 

0.00333 

(0.00009) 

0.00333 

(0.00096) 

0.01855 

(0.00045) 

CYCLE-PP 

0.00420 

(0.00009) 

0.00421 

(0.00011) 

0.03588 

(0.00079) 

0.00516 

(0.000197) 

0.00516 

(0.00019) 

0.04536 

(0.00097) 

0.00479 

(0.00027) 

0.00477 

(0.000289) 

0.03930 

(0.00076) 

HOUR 

0.01575 

(0.00029) 

0.01543 

(0.00034) 

0.02888 

(0.00077) 

0.01664 

(0.00046) 

0.01627 

(0.00043) 

0.04021 

(0.00110) 

0.01285 

(0.00031) 

0.01259 

(0.00035) 

0.03821 

(0.00103) 

CAL-HOUSING 

0.02426 

(0.00126) 

0.02415 

(0.00117) 

0.05406 

(0.00135) 

0.02546 

(0.00123) 

0.02518 

(0.00119) 

0.07473 

(0.00158) 

0.01919 

(0.00071) 

0.01912 

(0.00064) 

0.07337 

(0.00144) 


Table 2; Average test error (standard deviation) for 2D-SVM with unclipped duality gap stopping criteria (ER-SVM), 2D-SVM with clipped duality gap stopping criteria 
(ER-SVM*) and ER-Boost. The average test error (standard deviation) was computed on 25 random splits for small data sets, 10 random splits for medium size 
data sets and 5 random split for larger size data sets. 


DATA 

t = 0.25 

t = 0.50 

r = 0.75 

ER-SVM 

ER-SVM* 

ER-Boost 

ER-SVM 

ER-SVM* 

ER-Boost 

ER-SVM 

ER-SVM* 

ER-Boost 

NC-CRIME 

0.305 

0.317 

20.954 

0.323 

0.318 

21.545 

0.298 

0.311 

21.595 

CONCRETE-COMP 

0.983 

1.028 

1.861 

1.027 

1.089 

1.899 

0.964 

1.018 

1.8025 

AIRFOIL-NOISE 

2.078 

2.173 

0.645 

2.234 

2.342 

0.656 

2.122 

2.232 

0.649 

MUNICH-RENT 

2.413 

2.485 

9.288 

2.385 

2.476 

9.542 

2.364 

2.426 

9.460 

UPDRS-MOTOR 

43.874 

46.737 

110.853 

47.819 

47.819 

114.967 

42.614 

45.537 

114.335 

HEAD-CIRCUM 

34.352 

36.173 

1.7826 

36.928 

39.029 

1.7529 

36.744 

37.256 

1.796 

CYCLE-PP 

83.452 

85.893 

2.7473 

91.127 

93.897 

2.758 

85.690 

87.309 

2.714 

HOUR 

307.249 

318.357 

70.376 

315.692 

327.897 

69.576 

281.972 

288.479 

68.536 

CAL-HOUSING 

506.679 

529.945 

39.913 

535.835 

550.364 

39.974 

458.223 

479.735 

38.880 


Table 3: 


Training time (in seconds) for 2D-SVM with unclipped duality gap stopping criteria (ER-SVM), 2D-SVM with clipped duality gap stopping criteria (ER-SVM*) 
and ER-Boost. 





A Proofs 


The proofs of Lemma 4 and Lemma 5 are trivial and therefore omitted. The rest of the proofs are 
given below. 

Proof of Theorem 1. The first convergence follows from [32, Theorem 9.1] and the second 
convergence is a consequence of the first convergence and [32, Corollary 3.62], where we note that 
we do not need the completeness of X since we already know the existence and uniqueness of 

ftp • □ 

Proof of Theorem 2. i) We first show that W has a global maximum at a*. To do this, we 
proceed by contradiction, that is, we assume that there exists an a G M m with 


W(a*) < W(a ). (48) 

By concavity of W, we conclude that for t G [0,1] 

W({ 1 - t)a* + to) > (1 - t)W{a*) + tW(a ). (49) 

On the other hand, h := t(a — a*) G M m and Taylor’s theorem in the multiple dimensional version 
yields 

W((l-t)a*+ta) = W(a*+ h), 

= W(a*) + {W'(a*), h) + l -(h , W"(a*)h) + 0(|| h 2 ||), 

= W(a*) + ^(a- a *, W"(a*){ot - a*)) + 0{t 2 ). 

Using this in (49) we obtain 

+2 

W(a*) + t(W(a) - W(a*)) < W(a*) + -{a-a*, W"(a*)(a - a*)) + 0(t 2 ), (50) 

and thus 

c\t < + 0{t 2 ), 

where c\ := W{a) — W(a*) and C2 := {a — a*,W"(a*){a — a*)). Furthermore, we have C2 < 0 
since W is concave and c\ > 0 by (48). For sufficiently small t > 0, (50) is therefore impossible 
and hence (48) can not be true. Let us now show that W has no other global maximum. To show 
this, we assume the converse, that is, W has a global maximum at some a** ^ a*. Then we obtain 
= 0 by usual calculus, and hence our assumptions are violated. Consequently, W has its 
only global maximum at a*. 

ii) If a* £ A then we also have a* ^ A. where A denotes the interior of A, and for a G A we 
thus have a / a*. Let us now show that for all a G A there exists an a* G <9.4 with 


W(a*) > W(a). 


To this end, we fix an a G A and consider the function 


Furthermore, we set 


7 : [ 0 , 1 ] -> R m 

£<-)•( 1 — t)a* + ta . 

h := W o 7 . 


(51) 
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Then it is easy to see that h is concave. Moreover, since a / cO, we find 7 (t) / a* for all t £ (0,1] 
and thus h(t) < h( 0) for all t £ (0,1]. By the concavity of h we conclude that h is strictly decreasing. 
We now show that there exists t* £ (0,1] with 7(C) £ dA. Let us assume the converse, that is, 
T U dA = 0, where T := y([0,1]). Considering the partition A, dA, M m \M, where A denotes the 
closure of A, we then find by the assumed A = A and T U dA = 0 that 

Bi := TnA 

b 2 -.= r n M m \M = r n (R m \A), 

is a partition of T. Since a £ A and a* £ A, we further find B\ 7^ 0 and B 2 / 0. Moreover, since 
M m \M is open, the sets B 1 and B 2 are relatively open in T and T. However, the continuous image 
of a connected set, is connected and thus T is connected. This leads to a contradiction, and hence 
there exists a t* £ [0,1] with 7 (t*) £ dA. Clearly, we have t* < 1 since a ^ dA. For a * := 7(C), 
the already established strict monotonicity of h then shows 

W(a*) = h(t*) > h{ 1) = W(a). 

Consequently we have shown (51) and thus 

sup W(a) = sup IT (a). 
a£dA oeA 

In other words, it suffices to show that the supremum over dA is attained at some a* £ dA. To 
this end, we first show that {W > p} is bounded for all p < W* := W(a*). For a £ 5, where 
S C M m denotes the Euclidean unit sphere, we define 

h a : [0,00) —> M m 

t t-A- W(a* + ta ). 

Then h a is concave and continuously differentiable, and has a global maximum at t = 0. 
Moreover, h a is strictly decreasing with lim h a (t) = —00. We define 

t—> OO 


t a := max{f > 0 : h a (t) > p}, 


where we note that the maximum is indeed attained by the continuity of h a and t a < 00 . Our 
next intermediate goal is to show that a 1 —> t a is continuous. To this end, we fix an ao £ S , and an 
£ > 0 with y/e < —h' a (t Q 0 ), where we note that h' a (t a 0 ) < 0 since h ao is strictly decreasing and 
W* > p. Since W is continuous differentiable, then there exist a 6 > 0 such that for all a £ S with 
11ao — a ll < $ we have 

Ko(^o) - h' a (ta o)| < e. 

For t a > t Qo , the concavity, or more precisely, the subdifferential inequality of —h' a (t a 0 ), then gives 


ha(ta) ^ h a (t a 0) + h a (t ao )(t a tao) i 

< h aQ (t ao ) + £ + (h Qo (t a 0) + s)(t a — t ao ) , 

— hagitag) + £ + —h a g(t ao )(t a — t ao ) ■ 


Now recall that h a (t a ) = p = h ao (t ao ). Thus we obtain 


0<£+ ah'an(ta 0 ){ta ta 0 ) j 


21 



and since h! a (t ao ) < 0, we conclude that 


-2e 


b'ao i^ao) 


> t n ~ t, 


ao > 


and thus 


t a < t 


ao 


+ 



< 2y/e. 


Since an analogous bound can be established in the case t < 

continuous. Consequently, there exist an ao £ S with t Qo = sup 

aeS 


Now we show that there exist a* £ A with 


t ao , we conclude that a t a is 
t a , and thus {W > p} is bounded. 


W* := sup W(a) = W(a *). 

aeA 

Clearly there is an (a n ) C A with 

W(a n ) ->■ W * , 

and since {W > p] is bounded, the sequence a n is also bounded. Then there is a subsequence a nk 
and an a* with a Uk —> a* and the continuity of W yields W(a nk ) — > W{a*). Consequently, we 
have shown W(a*) = W(a). Finally a* = lirn a nk £ A follows from A = A. □ 


Proof of Theorem 3. If c t = 0, there is nothing to prove. Let us assume that Ci > 0. Since 
b\ , b '2 £ (l,oo), then only (17) provides a feasible solution c* > 0 because (if < 0 in (16). Similarly, 
if we assume that c* < 0, then af < 0 in (17) while > 0 in (16) which makes it feasible 
solution. We finally conclude that only one of two cases provides the feasible optimal solution 
when Ci A 0. □ 

Proof of Lemma 6. i) Since T\ > 0 and T -2 > 0, we have ^-Cj < Cj < Since we assumed 

that Ci A 0 or Cj / 0, we conclude from the latter and b- 2 , k > 0 that we actually have c* / 0 and 
cj A 0- Moreover, b -2 > 1 and fc < 1 shows that c$ < 0 and Cj < 0. 


ii) Since T3 > 0 and X4 > 0, we have < Cj < ^-c*. Since we assumed that c* / 0 or Cj / 0, 
we conclude from the latter and b\,k > 0 that we actually have c* / 0 and Cj / 0. Moreover, 
61 > 1 and k < 1 shows that c* > 0 and Cj > 0. 

Finally, this leads to conclude that T±, T 2 , T3 and T4 are not simultaneously positive. By similar 
arguments, it can be shown that these expressions are not simultaneously negative. □ 


Proof of Theorem 7. Our first goal is to show that at most one of the four cases (38), (39), 
(40) and (41) leads to a feasible solution. To this end we note that (38) is feasible if and only if 
T\ and T 2 are non-negative. Similar consideration from (39) to (41) leads to the Table 4. Now let 


Optimal Solution 

T\ 

T 2 

Ts 

T a 

(38) feasible 

> 0 

> 0 

- 

- 

(39) feasible 

- 

- 

> 0 

> 0 

(40) feasible 

- 

< 0 

< 0 

- 

(41) feasible 

< 0 

- 

- 

< 0 


Table 4: Behavior of expressions Ti, T 2 , T3 and T4 when any optimal solutions is feasible 
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assume that (38) is feasible. By Lemma 6 we then see that (39) is not feasible. Moreover, if (40) 
was feasible, we would have 

kci — b-2Cj , 

which implies Ci = Cj = 0 by k < 1 and b 2 > 1. Since the latter contradicts the assumed c t yf 0 
or Cj 7^ 0, we therefore conclude that (40) is not feasible. Analogously, (41) is not feasible. Hence, 
we have shown that if (38) is feasible, the remaining cases (39) to (41) are not feasible. Since the 
arguments can be repeated using Table 4 when one of the remaining cases (39) to (41) is considered 
feasible, we finally conclude that at most one of the four cases is feasible, that is, we have shown 
our intermediate result. 

Let us now assume that none of the four cases yield a feasible solution. Then we obtain Table 
5, where in each row, at least one of the inequalities needs to be true. Let us assume that Tf < 0, 


Optimal Solution 

T\ 

T 2 

T 3 

Ta 

(38) not feasible 

< 0 

< 0 

- 

- 

(39) not feasible 

- 

- 

< 0 

< 0 

(40) not feasible 

- 

> 0 

> 0 

- 

(41) not feasible 

> 0 

- 

- 

> 0 


Table 5: Behavior of expressions Tf, T 2 , T3 and X4 when none of the optimal solutions is feasible 


then by Table 5, we conclude that we have following set of inequalities 


kcj < b 2 Ci , (52) 

kci > b- 2 Cj , (53) 

b\ Ci < kcj , (54) 

b\Cj > kci. (55) 

Combining (52) and (54) as well as (53) and (55), we obtain 

bid < b- 2 Ci, (56) 

b 2 Cj < b\Cj . (57) 


Now if Ci < 0, we find Cj < 0 by (52). Moreover (56) together with c* < 0 implies b 2 < b±, while 
(57) together with Cj < 0 implies 61 < b 2 , that is, we have found a contradiction. Analogously, we 
obtain a contradiction in the case T\ >0. As a consequence exactly one of the four cases produces 
a feasible solution. Finally, the implications are a direct consequence of the form of the solutions 
in (38) to (41) and the fact that only one case provides a feasible solution. □ 
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A Detailed Results of Experiments 

A.l Results for Different Working Set Selection Methods 








Figure 2: Train time (top) and corresponding ratio (bottom) of different data sets for different 
working set selection methods after fixing warm start initialization and stopping criteria 
with clipped duality gap. The graphs comprises of r = 0.25 (left), r = 0.50 (middle) and 
r = 0.75 (right). 
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Figure 3: Train iterations (top) and corresponding ratio (bottom) of different data sets for different 
working set selection methods after fixing warm start initialization and stopping criteria 
with clipped duality gap. The graphs comprises of r = 0.25 (left), r = 0.50 (middle) and 
r = 0.75 (right). 
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Figure 4: Average train time (left) and corresponding ratio (right) per grid point for different work¬ 
ing set selection strategies using clipped duality gap criteria and initializing solver with 
warm start for data set CAL-HOUSING. For WSS 2, 15 nearest neighbors are considered. 
The graphs comprises for r = 0.25 (top), r = 0.50 (middle) and r = 0.75 (bottom). 













Figure 5: Average number of iterations (left) and corresponding ratio (right) per grid point for 
different working set selection strategies using clipped duality gap criteria and initializing 
solver with warm start for data set cal-housing. For WSS 2, 15 nearest neighbors are 
considered. The graphs comprises for r = 0.25 (top), r = 0.50 (middle) and r = 
0.75 (bottom). 
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A.2 Results for Different Number of Nearest Neighbors 








Figure 6: Train time (top) and corresponding ratio (bottom) of different data sets for different 
number of nearest neighbors after fixing warm start initialization and stopping criteria 
with clipped duality gap. The graphs comprises of r = 0.25 (left), r = 0.50 (middle) and 
r = 0.75 (right). 
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Figure 7: Train iterations (top) and corresponding ratio (bottom) of different data sets for different 
number of nearest neighbors after fixing with warm start initialization and stopping 
criteria with clipped duality gap. The graphs comprises of r = 0.25 (left), r = 0.50 
(middle) and r = 0.75 (right). 
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Figure 8: Average train time (left) and corresponding ratio (right) per grid point for different 
different number of nearest neighbors considering warm start and stopping criteria with 
clipped duality gap for the data set CAL-HOUSING. The graphs comprises for t = 0.25 
(top), r = 0.50 (middle) and r = 0.75 (bottom). 
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Figure 9: Average number of iterations (left) and corresponding ratio (right) per grid point for 
different different number of nearest neighbors considering warm start and stopping cri¬ 
teria with clipped duality gap for the data set cal-HOUSING. The graphs comprises for 
r = 0.25 (top), r = 0.50 (middle) and r = 0.75 (bottom). 
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A.3 Results for Different Initialization Methods 



Figure 10: Train time (top) and corresponding ratio (bottom) of different data sets for different 
initialization methods after fixing stopping criteria with clipped duality gap and NN = 
15. The graphs comprises of r = 0.25 (left), r = 0.50 (middle) and r = 0.75 (right). 
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Figure 11: Train iterations (top) and corresponding ratio (bottom) of different data sets for different 
initialization methods after fixing stopping criteria with clipped duality gap and NN = 
15. The graphs comprises of r = 0.25 (left), r = 0.50 (middle) and r = 0.75 (right). 
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(left) and corresponding ratio (right) per grid point for different 
'ds considering stopping criteria after with clipped duality gap and 
set CAL-HOUSING. The graphs comprises for r = 0.25 (top), r = 0.50 
.75 (bottom). 
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Figure 13: Average train iterations (left) and corresponding ratio (right) per grid point for different 
initialization methods considering stopping criteria with clipped duality gap and WSS 
2 for the data set cal-housing. The graphs comprises for r = 0.25 (top), r = 0.50 
(middle) and r = 0.75 (bottom). 
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A.4 Results for two Different Stopping Criteria 




Figure 14: Train time (top) and corresponding ratio (bottom) of different data sets for different 
stopping criteria after fixing initialization method as warm start and NN = 15. The 
graphs comprises of r = 0.25 (left), r = 0.50 (middle) and r = 0.75 (right). 
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Figure 15: Train iterations (top) and corresponding ratio (bottom) of different data sets for different 
stopping criteria after fixing initialization method as warm start and NN = 15. The 
graphs comprises of r = 0.25 (left), r = 0.50 (middle) and r = 0.75 (right). 
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Figure 16: Average train time (left) and corresponding ratio (right) per grid point for different 
stopping criteria using 15 NN and initializing solver with warm start for CAL-HOUSING. 
The graphs comprises for t = 0.25 (top), r = 0.50 (middle) and r = 0.75 (bottom). 
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Figure 17: Average number of iterations (left) and corresponding ratio (right) per grid point for 
different stopping criteria using 15 NN and initializing solver with warm start for CAL- 
HOUSING. The graphs comprises for r = 0.25 (top), r = 0.50 (middle) and r = 0.75 
(bottom). 
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